Seeking help with the right syntax to examine the collinearity between two categorical variables in STATA. *

Emerald Chang

Join Date: Sep 2017

Posts: 50
#1

Seeking help with the right syntax to examine the collinearity between two categorical variables in STATA. *

08 Jul 2018, 21:08

Hi everyone,

Seeking help with the right syntax to examine the collinearity between two categorical variables in STATA.

I would like to include these two variables – mode of delivery (1. Non-labour, 2. Intrapartum and 3. vaginal delivery) and time of placental collection (1. <=30 min, 2. 31 to 60 min, 3. 61 to 90 min and 4. 91 and above) in my multivariate regression model. But, we suspected that there might be collinearity between these two categorical (MOD – nominal variables, and time of collection – ordinal variable).

Since “corr” is only applicable for those variables appear to be continuous, so I wonder if VIF is the right syntax for me to examine the collinearity for those two categorical variables here? I gave it a try of " vce,corr" as well.

And also how do I interpret the results shown below? As I know that 10 has been recommended as the cut off for VIF, so in this case, am I confident to say that there is a weak collinearity between MOD and time of placental collection? (please kindly refer to the bolded values of VIF results derived from Stata below)

Thank you and truly appreciate your assistance in this matter. ; )

Commands I used to perform VIF:

reg z_MI mo_age i.mo_Chinese pp_bmi i.parity_2 i.tobacco_2 GA_weeks i.child_sex i.new_MOD i.time_2 ogtt_2hour

vif

. vif

Variable VIF 1/VIF

mo_age 1.30 0.766470
2.mo_chinese 1.21 0.823911
pp_bmi 1.16 0.864906
2.parity_2 1.24 0.803848
1.tobacco_2 1.08 0.927086
GA_weeks 1.07 0.933429
2.child_sex 1.03 0.972162

new_MOD
2 1.60 0.623646
3 1.57 0.635354

time_2
2 2.29 0.436235
3 1.99 0.501730
4 1.63 0.613657

ogtt_2hour 1.10 0.907737

Mean VIF 1.41
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 30014
#2

08 Jul 2018, 22:24

Multicolinearity is, in my opinion, the most over-rated problem in all of statistics. If you have the time, find Goldberger's econometrics textbook and read the chapter he wrote denouncing it. I endorse his views on it.

It is even more problematic when we are talking about categorical variables. By their very construction, there will be multicolinearity among the indicators (dummies) for each variable, even if there is little colinearity across the categories. So VIF results become difficult to interpret in this situation. The most direct way to determine the extent to which these two variables are associated with each other is just to cross-tabulate them. -tab time_2 new_MOD-.

But even that assumes that there is a reason to look into colinearity here. There isn't. Multicolinearity is to some degree present in nearly all observational data sets. The effect of multicolinearity is to decrease the precision with which the coefficients of the variables involved can be estimated. In the worst case of a nearly perfect colinear relationship between two variables, the two coefficients will come with huge standard errors and their 95% confidence intervals will be so wide that it is patently obvious that the results are uninformative. In the usual situation, the effects are mild to moderate. So rather than trying to test whether or not you have multicolinearity (you do--I can tell you that without knowing anything about your data beyond that it is observational) or quantify it, look to see if you have a colinearity problem. The way to see that is by looking at the confidence intervals for the coefficients that are of importance to your research goals. From your description of the problem, it sounds like time and mode of delivery are both important variables in your project--not just something thrown in to adjust ("control") for. So you need reasonably precise estimates of their coefficients. So look at those confidence intervals. Are they narrow enough that, for practical purposes, you can reasonably answer your research questions about these variables. If so, there is no problem and you should not waste another second thinking about multicolinearity. If, however, some or all of the confidence intervals are too wide to support answers to your research questions, then you do have a multicolinearity problem.

Now, here's the really bad news. If you do have a multicolinearity problem, there probably isn't anything you can do about it. At least nothing you can do easily. Here are the options, all of them difficult or unsatisfactory:

1. Drop one of the variables (or in the case of a multi-level category variable, combine some categories). Given that you have identified these variables as important to the research goals, eliminating or redefining them is clearly undesirable.

2. Get a much larger data set. Goldberger says that multicolinearity should be called hyponumerosity. And he is right. It means that for the level of association among the variables, the data set is too small to separate their different effects. The problem is that the data set typically has to be one or two orders of magnitude bigger than what you have, which, in most settings, is simply not feasible.

3. Scrap your current design and collect a new data set using a sampling scheme that breaks the associations between these variables. This means some sort of matching or stratification scheme that will assure that the distributions of time_2 and new_MOD are essentially independent in your sample, even though at the population level they are strongly correlated. This is probably the best of the three approaches, but it really entails starting over from square one, which is demoralizing and usually prohibitively expensive.

To recapitulate: don't waste time thinking about multicolinearity or trying to quantify it. Just look at your regression output to see if the confidence intervals around the coefficients of the important variables are narrow enough to be useful for answering your research questions. If they are, you are home free. If they are not, you are probably up the creek without a paddle.
1 like
Comment
Emerald Chang

Join Date: Sep 2017

Posts: 50
#3

09 Jul 2018, 01:00

Hi Clyde,

Many thanks for your reply. Truly appreciate your help

Just had a look at the output of those two variables in a multivariate regression just now. The 95% CI of these two variables with related to their coefficient values seem narrow enough to me (Please kindly refer to output 1) . So, presumably, they are useful enough for me to answer my research question ?

Also, with regard to command "tab time_2 new_MOD-." you suggested above, I presume it must have something to do with chi squared test?

I ran a chi squared test as well , and the p value itself is less than 0.001 (Output 2), these two variables are found to be dependent of one another.

so in this case, I just need to choose one variable instead of two (MOD or time _2) to include in my regression model will do?

Apparently,to my understanding, Mode of delivery kinda determines the duration of a placental tissue can be collected. C-section gives rise to shorter time than vaginal delivery.

Output 1

z_MI Coef. Std. Err. T P>t [95%Conf. Interval]

new_MOD
2_IN -.1726411 .139012 -1.24 0.215 -.4455527 .1002706
3_VD -.2964244 .0976085 -3.04 0.002 -.4880518 -.104797

time_2
2_31 to 60min -.1415133 .1078543 -1.31 0.190 -.3532553 .0702288
3_61 to 90 min -.2482731 .1244511 -1.99 0.046 -.4925984 -.0039477
4_91 and above -.2663024 .1430007 -1.86 0.063 -.5470447 .0144398

Output 2

MOD MI time collection in 4 groups
1_<=30 2_31 to 6 3_61 to 9 4_91 and above Total

1_NL 12 102 18 8 140
2_IN 11 53 18 9 91
3_VD 87 282 123 72 564

Total 110 437 159 89 795

Pearson chi2(6) = 24.5298 Pr = 0.000

Thank you once again in advance!
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3436
#4

09 Jul 2018, 01:31

A statistical test is not appropriate for a decision to remove a variable because of multicolinearity. You need to look at the size of the association, not the statistical significance, and that association has to be really really really unrealistically huge before deciding to that multicolinearity is a problem. From your output you can compute Cramer's V (you could have gotten that in the output by adding the V option in tab), which is about 0.12. This is a smallish association, definitely nowhere near enough to make me worried about multicollinearity. So I would just leave both in your model.

Your output is virtually unreadable. The way to solve that is the use CODE delimititors. When you write a message on Statalist you see in the top right corner of your message a button with a underlined A. Press that, and you'll see more button to press, this time look for the #, this will add the delimators to your message, and between them you can copy your output (please with the command so we know what it is exactly that you are showing us).

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment
Emerald Chang

Join Date: Sep 2017

Posts: 50
#5

09 Jul 2018, 02:24

Hi Marrten,

Thanks for the tip. I have tried my best to reformat those outputs just now ,but it did not end up well.

I hope that they look clearer this time. Thank heaps

. tabulate new_MOD time_2, chi2

mode_vd_in | MI time collection in 4 groups
_nl | 1_<=30 2_31 to 6 3_61 to 9 4_91 and | Total
-----------+--------------------------------------------+----------
1_NL | 12 102 18 8 | 140
2_IN | 11 53 18 9 | 91
3_VD | 87 282 123 72 | 564
-----------+--------------------------------------------+----------
Total | 110 437 159 89 | 795

Pearson chi2(6) = 24.5298 Pr = 0.000

. tabulate new_MOD time_2, V

mode_vd_in | MI time collection in 4 groups
_nl | 1_<=30 2_31 to 6 3_61 to 9 4_91 and | Total
-----------+--------------------------------------------+----------
1_NL | 12 102 18 8 | 140
2_IN | 11 53 18 9 | 91
3_VD | 87 282 123 72 | 564
-----------+--------------------------------------------+----------
Total | 110 437 159 89 | 795

Cramér's V = 0.1242
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

09 Jul 2018, 05:15

It seems your attempt to use a CODE block was not successful. Let me try a restatement of the technique. One thing is important though - you should copy your output fresh from Stata, not from a previous post that lacked the CODE block, because once it's posted, the spacing is gone.

To assure maximum readability of results that you post, you copy them from the Results window or your log file into a CODE block in the Forum editor, as explained in section 12 of the Statalist FAQ linked to at the top of the page. You can create the CODE tags as Maarten described in post #4, or you can just type them as you would any other content. Regardless of how the CODE tags are created, the following example:

[CODE]
. sysuse auto, clear
(1978 Automobile Data)

. describe make price

storage display value
variable name type format label variable label
-----------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
[/CODE]

will be presented in the post as the following:

Code:

. sysuse auto, clear (1978 Automobile Data) . describe make price storage display value variable name type format label variable label ----------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price

You can check your post by using the Preview button next to the Post button before posting it.
1 like
Comment
Emerald Chang

Join Date: Sep 2017

Posts: 50
#7

09 Jul 2018, 22:02

Hi William,

Thanks for the tip. I did copy results/output from Stata to the CODE block straight away, which was mentioned by Maarten earlier on. But, it seems that it is not just about copy and paste , some adjustments still need to be made before a clear output can be shown.
Okay. Will play around with it and pretty sure I will nail this down in no time :P (fingers crossed).

Thank you all and truly truly appreciate each of your posts above.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

10 Jul 2018, 05:06

But, it seems that it is not just about copy and paste , some adjustments still need to be made before a clear output can be shown.

That is not correct. No adjustments are needed.

Will play around with it

There is a Statalist Forum called "Sandbox" within which you can make posts that people will not respond to. This is a good place to experiment by making actual posts, copying from the Stata Results window and pasting into the post, surrounding the pasted material with CODE tags, as shown in post #6.
2 likes
Comment

Announcement

Seeking help with the right syntax to examine the collinearity between two categorical variables in STATA. *

Comment

Comment

Comment

Comment

Comment

Comment

Comment